Skip to content

Conversation

@hhoikoo
Copy link
Member

@hhoikoo hhoikoo commented Oct 31, 2025

resolves #6432 (BA-2851)

This change adds configuration for partitioning resources rather than every agent always seeing the full resource pool. This prevents unintended over-allocation that could crash kernels.

  • SHARED: allows all agents to see full resources (useful for stress testing). This is the same behavior as before.
  • AUTO_SPLIT: automatically divides resources equally among agents.
  • MANUAL: lets users specify exact per-agent allocations for all resources.

Single-agent deployments remain unaffected and retain access to all available hardware resources.

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • Update of end-to-end CLI integration tests in ai.backend.test
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

@github-actions github-actions bot added size:XL 500~ LoC comp:agent Related to Agent component labels Oct 31, 2025
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from e6c1f4b to d84258e Compare October 31, 2025 01:30
@hhoikoo hhoikoo requested a review from Copilot October 31, 2025 01:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces resource isolation options for multi-agent setups, enabling multiple agents to run on the same physical host with controlled resource allocation. The implementation adds three allocation modes: SHARED (default, backward compatible), AUTO_SPLIT (automatic equal division), and MANUAL (explicit per-agent configuration).

Key changes:

  • Introduces ResourcePartitioner class to manage resource allocation across agents
  • Adds ResourceAllocationMode enum with SHARED, AUTO_SPLIT, and MANUAL modes
  • Implements validation logic to ensure consistent manual allocations across agents
  • Updates agent initialization to use resource partitioning

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 23 comments.

Show a summary per file
File Description
src/ai/backend/agent/resources.py Adds ResourcePartitioner class and changes abstract methods to raise NotImplementedError
src/ai/backend/agent/config/unified.py Defines allocation modes, new config fields (allocated_cpu/mem/disk/devices), and validation logic
src/ai/backend/agent/agent.py Integrates ResourcePartitioner into agent initialization and updates slot calculations
src/ai/backend/agent/server.py Creates ResourcePartitioner instances per agent and adds resource reconciliation
src/ai/backend/agent/docker/agent.py Adds resource_partitioner parameter to constructor
src/ai/backend/agent/kubernetes/agent.py Adds resource_partitioner parameter to constructor
tests/agent/test_resource_allocation.py Comprehensive unit tests for all three allocation modes
tests/agent/test_config_validation.py Tests for config validation of allocation modes and device consistency
tests/agent/docker/test_agent.py Updates test to pass ResourcePartitioner to agent
changes/6498.feature.md Changelog entry

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from d84258e to c5114a9 Compare October 31, 2025 03:56
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 9f12687 to fdee4b0 Compare November 3, 2025 01:05
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 3 times, most recently from 310d847 to 3faac0f Compare November 4, 2025 06:02
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from fdee4b0 to 90f0702 Compare November 4, 2025 06:10
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 2 times, most recently from 36824ac to 279e71b Compare November 4, 2025 06:30
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 90f0702 to e2b1902 Compare November 4, 2025 06:35
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 3 times, most recently from 280831f to db07080 Compare November 4, 2025 10:32
@github-actions github-actions bot added the comp:manager Related to Manager component label Nov 4, 2025
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 3 times, most recently from ce120ef to 13c7be6 Compare November 6, 2025 01:01
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from e2b1902 to 9c34302 Compare November 6, 2025 01:08
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 2 times, most recently from 80ecc2c to 7462fd8 Compare November 6, 2025 01:10
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 9c34302 to 04a0f3a Compare November 6, 2025 01:47
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from 7462fd8 to e17be1c Compare November 6, 2025 01:52
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 04a0f3a to 92c3bd7 Compare November 6, 2025 01:58
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from e17be1c to d936ce3 Compare November 6, 2025 01:59
@hhoikoo hhoikoo force-pushed the feat/BA-2753/multiple-agents branch from 92c3bd7 to f0f7510 Compare November 6, 2025 02:07
@github-actions github-actions bot added the comp:common Related to Common component label Nov 12, 2025
@hhoikoo hhoikoo removed the comp:manager Related to Manager component label Nov 12, 2025
@hhoikoo hhoikoo force-pushed the feat/BA-3024/multi-agent-resources-config branch from 66b563f to 89a0562 Compare November 13, 2025 08:01
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch 2 times, most recently from 2d01787 to 6f857be Compare November 13, 2025 09:49
@hhoikoo hhoikoo force-pushed the feat/BA-3024/multi-agent-resources-config branch from 159bd39 to 80a5685 Compare November 14, 2025 00:55
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from a8c60e7 to 4d718b6 Compare November 14, 2025 00:56
@hhoikoo hhoikoo force-pushed the feat/BA-3024/multi-agent-resources-config branch 2 times, most recently from b8cfb1b to 33552a1 Compare November 14, 2025 02:04
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from 4d718b6 to c0102e8 Compare November 14, 2025 02:34

class SlotName(UserString):
__slots__ = ("_parsed", "_device_name", "_major_type", "_minor_type")
__match_args__ = ("device_name", "major_type", "minor_type")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this added?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary but I wanted to use pattern matching here (https://github.com/lablup/backend.ai/pull/6498/files#diff-a4da2a344d73525736025bcd638112245de4a7225d6293d21a7e1e5152224ec8R675) and this class needs this added for match statement to work

Comment on lines +712 to +747
async def _load_resources(self) -> Mapping[DeviceName, AbstractComputePlugin]:
local_config_dump = self.local_config.model_dump(by_alias=True)

match self.local_config.agent_common.backend:
case AgentBackend.DOCKER:
from .docker.resources import load_resources as docker_load

return await docker_load(self.etcd, local_config_dump)
case AgentBackend.KUBERNETES:
from .kubernetes.resources import load_resources as kubernetes_load

return await kubernetes_load(self.etcd, local_config_dump)
case AgentBackend.DUMMY:
from .dummy.config import DEFAULT_CONFIG_PATH, dummy_local_config
from .dummy.resources import load_resources as dummy_load

raw_config, _ = read_from_file(DEFAULT_CONFIG_PATH, "dummy")
dummy_config = dummy_local_config.check(raw_config)
return await dummy_load(self.etcd, local_config_dump, dummy_config)

async def _scan_available_resources(self) -> Mapping[SlotName, Decimal]:
compute_device_types = {name: cctx.instance for name, cctx in self.computers.items()}

match self.local_config.agent_common.backend:
case AgentBackend.DOCKER:
from .docker.resources import scan_available_resources as docker_scan

return await docker_scan(compute_device_types)
case AgentBackend.KUBERNETES:
from .kubernetes.resources import scan_available_resources as kubernetes_scan

return await kubernetes_scan(compute_device_types)
case AgentBackend.DUMMY:
from .dummy.resources import scan_available_resources as dummy_scan

return await dummy_scan(compute_device_types)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change, in terms of extensibility, rather seems like a regressive implementation and doesn't look good.

@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from c0102e8 to 73bf8f3 Compare November 14, 2025 04:31
Base automatically changed from feat/BA-3024/multi-agent-resources-config to feat/BA-2753/multiple-agents November 14, 2025 04:49
Base automatically changed from feat/BA-2753/multiple-agents to refactor/BA-3028 November 14, 2025 04:49
Base automatically changed from refactor/BA-3028 to feat/BA-3023/multi-agent-etcd November 14, 2025 04:49
Base automatically changed from feat/BA-3023/multi-agent-etcd to feat/BA-3026/agent-runtime November 14, 2025 04:49
Base automatically changed from feat/BA-3026/agent-runtime to feat/BA-2752/multiple-agents-config November 14, 2025 04:50
Base automatically changed from feat/BA-2752/multiple-agents-config to feat/BA-2750/config-table-syntax November 14, 2025 04:50
@hhoikoo hhoikoo changed the base branch from feat/BA-2750/config-table-syntax to main November 14, 2025 05:18
This change implements configuration for partitioning resources.

SHARED mode allows all agents to see full resources (useful for
stress testing). This is the same behavior as before.
AUTO_SPLIT automatically divides resources equally among agents.
MANUAL mode lets users specify exact per-agent allocations for all
resources.

Single-agent deployments remain unaffected and retain access to all
available hardware resources.
This change modifies the semantics of ResourcePartitioner so that it now
takes ownership over the devices and injects partitioned devices to
individual agents after initialization.
This change fixes a bug with resource splitting, where reserved
resources were accidentally being included in the total allocated for
each agent. This is because the way total slots are handled was
malformed, where the calculation of reserved resources from the
perspective of a single agent was being done without taking account of
server reserved resources properly. This change fixes this issue by
inverting the condition, where reserved resources are deducted only in
places where it is needed.
@hhoikoo hhoikoo force-pushed the feat/BA-2851/multi-agent-resources branch from 73bf8f3 to 2c41c25 Compare November 14, 2025 05:32
@hhoikoo hhoikoo closed this Nov 14, 2025
@hhoikoo
Copy link
Member Author

hhoikoo commented Nov 14, 2025

Will create a new PR

@hhoikoo hhoikoo deleted the feat/BA-2851/multi-agent-resources branch November 14, 2025 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component comp:common Related to Common component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add resource partitioning for agents within the same agent runtime

3 participants